Introduction

We are designing a dynamic analysis system for a database containing clinical, genomic, and proteomic data from 200 patients with Myotonic Dystrophy Type 1 (DM1), collected across 25 hospitals and entered by 13 different personnel.

Before building complex pipelines, any competent data team starts the same way: look at the data. This document walks through a short sequence of exploratory analyses that progress from simple descriptive plots to time-to-event modelling—the kind of work the system’s analytical layer (interactive notebooks, parameterised modules, materialised views) would enable from day one.

Why these analyses matter for DM1

Myotonic Dystrophy Type 1 is the most common adult-onset muscular dystrophy. It is a multi-systemic disorder: beyond progressive muscle weakness, patients experience cardiac conduction defects, endocrine abnormalities, cataracts, and cognitive impairment. A well-designed analytical system must therefore support exploration across multiple clinical domains and detect subtle longitudinal changes that unfold over months to years.

Data source

We use the publicly available random.cdisc.data package from the pharmaverse ecosystem, which generates synthetic CDISC ADaM datasets that mirror the structure of a real multi-centre clinical trial.

Dataset CDISC domain Description Key variables
cadsl ADSL Subject-Level Analysis AGE, SEX, RACE, ARM, BMRKR1/2
cadvs ADVS Vital Signs Analysis SYSBP, DIABP, PULSE per visit
cadlb ADLB Laboratory Analysis ALT, CRP, IGA per visit
cadaette ADAETTE Time-to-Adverse-Event Analysis Time, event/censor, by arm

Environment setup

All dependencies are declared in the project’s Nix flake, so inside nix develop every package below is available without manual installation.

# Install helper packages if not already present
for (pkg in c("gridExtra", "broom", "scales", "plotly", "htmltools", "webshot2")) {
  if (!requireNamespace(pkg, quietly = TRUE))
    install.packages(pkg, repos = "https://cloud.r-project.org", quiet = TRUE)
}

library(tidyverse)
library(survival)
library(random.cdisc.data)
library(gridExtra)
library(broom)
library(scales)

# Detect output format: interactive plotly for HTML, static ggplot for markdown
is_html <- knitr::is_html_output()
if (is_html) {
  library(plotly)
  library(htmltools)
}
set.seed(42)

1 — Data loading and preparation

We load four cached CDISC datasets and immediately inspect their dimensions.

data("cadsl");    adsl    <- cadsl
data("cadvs");    advs    <- cadvs
data("cadlb");    adlb    <- cadlb
data("cadaette"); adaette <- cadaette

tibble(
  Dataset  = c("ADSL", "ADVS", "ADLB", "ADAETTE"),
  Rows     = c(nrow(adsl), nrow(advs), nrow(adlb), nrow(adaette)),
  Columns  = c(ncol(adsl), ncol(advs), ncol(adlb), ncol(adaette))
) %>% knitr::kable(align = "lrr")
Dataset Rows Columns
ADSL 400 55
ADVS 16800 87
ADLB 8400 102
ADAETTE 3600 66

The cached data ships with 400 subjects. Our DM1 study has 200, so we subsample to match the intended scale. In a production system this step would not exist—the warehouse query would simply return the real cohort.

selected_ids <- adsl %>%
  distinct(USUBJID) %>%
  slice_sample(n = 200) %>%
  pull(USUBJID)

adsl    <- adsl    %>% filter(USUBJID %in% selected_ids)
advs    <- advs    %>% filter(USUBJID %in% selected_ids)
adlb    <- adlb    %>% filter(USUBJID %in% selected_ids)
adaette <- adaette %>% filter(USUBJID %in% selected_ids)

After subsampling we have 200 patients across 63 sites and 3 treatment arms (A: Drug X, B: Placebo, C: Combination).


2 — Demographic overview by treatment arm

Complexity level: descriptive / univariate.

The first step in any clinical analysis is understanding who is in the study. In a multi-centre trial we need to verify that treatment arms are balanced with respect to key covariates—imbalances in age or sex could confound every downstream comparison. In DM1 specifically, age at onset correlates with CTG repeat length, so any distributional skew is a red flag.

We show two complementary views:

# Patient counts by arm and sex
demo_summary <- adsl %>%
  count(ARM, SEX, name = "n_patients") %>%
  mutate(ARM = fct_reorder(ARM, n_patients, .fun = sum))

p1_left <- ggplot(demo_summary, aes(x = ARM, y = n_patients, fill = SEX)) +
  geom_col(position = "dodge", width = 0.7, colour = "grey30", linewidth = 0.3) +
  scale_fill_manual(
    values = c("F" = "#E07B91", "M" = "#6BAED6"),
    labels = c("F" = "Female", "M" = "Male")
  ) +
  labs(x = NULL, y = "Number of patients", fill = "Sex") +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1),
        legend.position = "top")

# Age density by arm
p1_right <- ggplot(adsl, aes(x = AGE, fill = ARM, colour = ARM)) +
  geom_density(alpha = 0.25, linewidth = 0.6) +
  labs(x = "Age (years)", y = "Density",
       fill = "Treatment arm", colour = "Treatment arm") +
  theme_minimal(base_size = 11) +
  theme(legend.position = "top")

if (is_html) {
  subplot(
    ggplotly(p1_left,  tooltip = c("x", "y", "fill")) %>%
      layout(legend = list(orientation = "h", y = 1.12)),
    ggplotly(p1_right, tooltip = c("x", "y", "fill")) %>%
      layout(legend = list(orientation = "h", y = 1.12)),
    nrows = 1, shareY = FALSE, titleX = TRUE, titleY = TRUE, margin = 0.06
  ) %>%
    layout(
      title = list(text = "Demographic Overview by Treatment Arm", x = 0.5),
      margin = list(t = 80)
    ) %>%
    add_plotly_config()
} else {
  grid.arrange(p1_left, p1_right, ncol = 2)
}

Fig. 1. Demographic overview by treatment arm. Left: patient counts by arm and sex. Right: overlaid age density curves by treatment arm.

What we see: The three arms are roughly balanced in size and sex ratio. Age distributions overlap substantially, suggesting randomisation achieved its goal. A formal test (e.g. ANOVA on age, chi-square on sex) could confirm this, but for an exploratory pass the visual check is sufficient.


3 — Vital signs trajectories over time

Complexity level: longitudinal / time-related.

DM1 is a multi-systemic disease with well-documented cardiovascular involvement—cardiac conduction defects, arrhythmias, and in some cohorts altered blood pressure regulation. Monitoring systolic blood pressure (SBP) across scheduled study visits is therefore clinically meaningful.

We use a spaghetti-plus-mean design:

This layered approach lets the reader simultaneously assess heterogeneity and treatment-level patterns without suppressing the underlying data.

sbp <- advs %>%
  filter(
    PARAMCD == "SYSBP",
    AVISIT  != "",
    !is.na(AVAL),
    !is.na(AVISITN)
  ) %>%
  select(USUBJID, ARM, AVISIT, AVISITN, AVAL)

# Summary statistics per arm per visit
sbp_summary <- sbp %>%
  group_by(ARM, AVISIT, AVISITN) %>%
  summarise(
    mean_val = mean(AVAL, na.rm = TRUE),
    se       = sd(AVAL, na.rm = TRUE) / sqrt(n()),
    n        = n(),
    .groups  = "drop"
  ) %>%
  mutate(
    lo = mean_val - 1.96 * se,
    hi = mean_val + 1.96 * se
  )

p2_static <- ggplot() +
  geom_line(
    data = sbp,
    aes(x = AVISITN, y = AVAL, group = USUBJID),
    alpha = 0.08, linewidth = 0.3, colour = "grey50"
  ) +
  geom_ribbon(
    data = sbp_summary,
    aes(x = AVISITN, ymin = lo, ymax = hi, fill = ARM),
    alpha = 0.25
  ) +
  geom_line(
    data = sbp_summary,
    aes(x = AVISITN, y = mean_val, colour = ARM),
    linewidth = 0.9
  ) +
  geom_point(
    data = sbp_summary,
    aes(x = AVISITN, y = mean_val, colour = ARM),
    size = 1.8
  ) +
  facet_wrap(~ ARM, nrow = 1) +
  scale_x_continuous(breaks = sort(unique(sbp_summary$AVISITN))) +
  labs(
    x = "Analysis visit number",
    y = "Systolic BP (Pa)",
    colour = "Arm", fill = "Arm"
  ) +
  theme_minimal(base_size = 11) +
  theme(legend.position = "none",
        strip.text = element_text(face = "bold"))

if (is_html) {
  arm_colours <- setNames(
    scales::hue_pal()(n_distinct(sbp_summary$ARM)),
    levels(sbp_summary$ARM)
  )

  p2_ly <- plot_ly()
  for (arm in unique(sbp_summary$ARM)) {
    arm_sbp  <- sbp %>% filter(ARM == arm)
    arm_summ <- sbp_summary %>% filter(ARM == arm) %>% arrange(AVISITN)

    # Individual traces (light, no legend entry)
    for (uid in unique(arm_sbp$USUBJID)) {
      d <- arm_sbp %>% filter(USUBJID == uid)
      p2_ly <- p2_ly %>% add_lines(
        data = d, x = ~AVISITN, y = ~AVAL,
        line = list(color = "rgba(160,160,160,0.12)", width = 0.8),
        hoverinfo = "text",
        text = ~paste0("Patient: ", USUBJID, "<br>Visit: ", AVISIT,
                       "<br>SBP: ", round(AVAL, 1)),
        showlegend = FALSE, legendgroup = arm
      )
    }

    # 95% CI ribbon
    p2_ly <- p2_ly %>%
      add_ribbons(
        data = arm_summ, x = ~AVISITN, ymin = ~lo, ymax = ~hi,
        fillcolor = paste0(arm_colours[arm], "33"),
        line = list(color = "transparent"),
        showlegend = FALSE, legendgroup = arm
      )

    # Mean line + points
    p2_ly <- p2_ly %>%
      add_trace(
        data = arm_summ, x = ~AVISITN, y = ~mean_val,
        type = "scatter", mode = "lines+markers",
        line = list(color = arm_colours[arm], width = 2.5),
        marker = list(color = arm_colours[arm], size = 7),
        name = arm, legendgroup = arm,
        hoverinfo = "text",
        text = ~paste0(arm, "<br>Visit: ", AVISIT,
                       "<br>Mean SBP: ", round(mean_val, 1),
                       "<br>95% CI: [", round(lo, 1), ", ", round(hi, 1), "]",
                       "<br>n = ", n)
      )
  }

  p2_ly %>% layout(
    title  = list(text = "Systolic Blood Pressure Trajectories Over Visits"),
    xaxis  = list(title = "Analysis visit number"),
    yaxis  = list(title = "Systolic BP (Pa)"),
    legend = list(orientation = "h", y = -0.15),
    hovermode = "closest"
  ) %>%
  add_plotly_config()
} else {
  p2_static
}

Fig. 2. Systolic blood pressure trajectories over scheduled study visits. Individual patient traces (grey) are overlaid with group mean ± 95 % CI by treatment arm.

What we see: Mean SBP stays relatively stable across visits in all three arms, with wide individual variability (a realistic pattern for a heterogeneous disease). In a real DM1 dataset we would also overlay cardiac conduction metrics (PR interval, QTc) from the ECG domain—cross-modal exploration that the proposed system is designed to support.


4 — Laboratory biomarker change from baseline

Complexity level: longitudinal / safety monitoring.

In DM1, hepatic and inflammatory markers deserve close surveillance:

Showing change from baseline (CHG) rather than raw values lets us focus on within-patient shifts—the natural framing for a longitudinal study and for the materialised views we proposed in the analysis system (the patient-level summary view stores exactly this kind of derived quantity).

lab_chg <- adlb %>%
  filter(
    PARAMCD %in% c("ALT", "CRP"),
    AVISIT  != "",
    !is.na(CHG),
    !is.na(AVISITN)
  ) %>%
  select(USUBJID, ARM, PARAMCD, PARAM, AVISIT, AVISITN, CHG) %>%
  mutate(AVISIT = fct_reorder(AVISIT, AVISITN))

p3_static <- ggplot(lab_chg, aes(x = AVISIT, y = CHG, fill = ARM)) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey40") +
  geom_boxplot(
    outlier.size = 0.7, outlier.alpha = 0.4,
    linewidth = 0.35, width = 0.7,
    position = position_dodge(width = 0.8)
  ) +
  facet_wrap(~ PARAM, scales = "free_y", ncol = 1) +
  labs(
    x    = "Scheduled visit",
    y    = "Change from baseline",
    fill = "Treatment arm"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    axis.text.x = element_text(angle = 35, hjust = 1, size = 8),
    legend.position = "top",
    strip.text = element_text(face = "bold", size = 11)
  )

if (is_html) {
  params <- unique(lab_chg$PARAMCD)
  p3_panels <- lapply(params, function(pc) {
    d <- lab_chg %>% filter(PARAMCD == pc)
    plot_ly(
      data = d, x = ~AVISIT, y = ~CHG, color = ~ARM,
      type = "box",
      hoverinfo = "y+name",
      showlegend = (pc == params[1])
    ) %>%
      layout(
        boxmode = "group",
        annotations = list(
          text = unique(d$PARAM), xref = "paper", yref = "paper",
          x = 0.5, y = 1.06, showarrow = FALSE,
          font = list(size = 14, face = "bold")
        ),
        shapes = list(list(
          type = "line", x0 = 0, x1 = 1, xref = "paper",
          y0 = 0, y1 = 0, line = list(color = "grey", dash = "dash")
        ))
      )
  })

  subplot(p3_panels, nrows = length(params), shareX = TRUE, titleY = TRUE) %>%
    layout(
      title  = list(text = "Change from Baseline in ALT and CRP Over Visits"),
      yaxis  = list(title = "Change from baseline"),
      yaxis2 = list(title = "Change from baseline"),
      boxmode = "group",
      legend  = list(orientation = "h", y = -0.1),
      margin  = list(t = 70)
    ) %>%
    add_plotly_config()
} else {
  p3_static
}

Fig. 3. Change from baseline in ALT and CRP across study visits by treatment arm. The dashed line marks zero change; boxes show the inter-quartile range with whiskers extending to 1.5 × IQR.

What we see: Both ALT and CRP change distributions remain centred around zero with no obvious arm-level divergence—a reassuring safety signal. The inter-quartile ranges widen slightly at later visits, consistent with increasing variability as follow-up time grows and some patients are lost. In a real DM1 study, we would flag individual patients whose ALT exceeds 3\(\times\) ULN (Hy’s Law boundary) and join their data with the genomic modality to look for variant-level associations—exactly the cross-modal workflow the system is designed to enable.


5 — Kaplan-Meier: time to first adverse event

Complexity level: time-to-event / inferential.

Time-to-event analysis is the most information-rich way to compare safety profiles across treatment arms. The Kaplan-Meier estimator gives non-parametric survival curves; the log-rank test provides a formal between-arm comparison.

In the proposed analysis system, this would be exposed as a parameterised module: the researcher selects the event of interest (any AE, serious AE, grade 3–5 AE), stratification variables, and optional covariates via a configuration file, and the pipeline produces the curves and test results automatically.

# Available time-to-event endpoints
tte_params <- adaette %>% distinct(PARAMCD, PARAM)
knitr::kable(tte_params, col.names = c("Code", "Description"))
Code Description
AEREPTTE Time to end of AE reporting period
AETOT1 Number of occurrences of any adverse event
AETOT2 Number of occurrences of any serious adverse event
AETOT3 Number of occurrences of a grade 3-5 adverse event
AETTE1 Time to first occurrence of any adverse event
AETTE2 Time to first occurrence of any serious adverse event
AETTE3 Time to first occurrence of a grade 3-5 adverse event
HYSTTEBL Time to Hy’s Law Elevation in relation to Baseline
HYSTTEUL Time to Hy’s Law Elevation in relation to ULN

We select AETTE1 (time to first occurrence of any adverse event)—the broadest safety endpoint.

tte_data <- adaette %>%
  filter(
    PARAMCD == "AETTE1",
    !is.na(AVAL),
    !is.na(CNSR)
  ) %>%
  select(USUBJID, ARM, AVAL, CNSR, PARAM) %>%
  distinct(USUBJID, .keep_all = TRUE) %>%
  mutate(
    event    = 1 - CNSR,
    time_wks = AVAL / 7
  )

event_label <- tte_data %>% pull(PARAM) %>% unique() %>% first()

# Kaplan-Meier fit
surv_obj <- Surv(time = tte_data$time_wks, event = tte_data$event)
km_fit   <- survfit(surv_obj ~ ARM, data = tte_data)

# Log-rank test
lr_test <- survdiff(surv_obj ~ ARM, data = tte_data)
lr_pval <- pchisq(lr_test$chisq, df = length(lr_test$n) - 1, lower.tail = FALSE)

Events observed: 125 / 200 patients experienced the event. Log-rank p-value: 0.0446.

km_tidy <- tidy(km_fit) %>%
  mutate(ARM = str_remove(strata, "^ARM="))

# Number-at-risk at evenly spaced time points
risk_times   <- seq(0, max(km_tidy$time, na.rm = TRUE), length.out = 6) %>% round(1)
risk_summary <- summary(km_fit, times = risk_times)
risk_tbl     <- tibble(
  time   = risk_summary$time,
  n.risk = risk_summary$n.risk,
  ARM    = str_remove(risk_summary$strata, "^ARM=")
)

if (is_html) {
  arm_colours <- setNames(
    scales::hue_pal()(n_distinct(km_tidy$ARM)),
    unique(km_tidy$ARM)
  )

  p4_ly <- plot_ly()
  for (arm in unique(km_tidy$ARM)) {
    d <- km_tidy %>% filter(ARM == arm) %>% arrange(time)

    # CI ribbon
    p4_ly <- p4_ly %>%
      add_ribbons(
        data = d, x = ~time, ymin = ~conf.low, ymax = ~conf.high,
        fillcolor = paste0(arm_colours[arm], "22"),
        line = list(color = "transparent"),
        showlegend = FALSE, legendgroup = arm,
        hoverinfo = "skip"
      )

    # Step curve
    p4_ly <- p4_ly %>%
      add_trace(
        data = d, x = ~time, y = ~estimate,
        type = "scatter", mode = "lines",
        line = list(color = arm_colours[arm], width = 2.2, shape = "hv"),
        name = arm, legendgroup = arm,
        hoverinfo = "text",
        text = ~paste0(
          arm,
          "<br>Time: ", round(time, 1), " weeks",
          "<br>Event-free: ", scales::percent(estimate, accuracy = 0.1),
          "<br>95% CI: [", scales::percent(conf.low, accuracy = 0.1),
          ", ", scales::percent(conf.high, accuracy = 0.1), "]",
          "<br>n.risk: ", n.risk,
          "<br>n.event: ", n.event
        )
      )

    # Censor tick marks
    censored <- d %>% filter(n.censor > 0)
    if (nrow(censored) > 0) {
      p4_ly <- p4_ly %>%
        add_markers(
          data = censored, x = ~time, y = ~estimate,
          marker = list(symbol = "line-ns", size = 8,
                        color = arm_colours[arm], line = list(width = 1.5)),
          showlegend = FALSE, legendgroup = arm,
          hoverinfo = "text",
          text = ~paste0("Censored (n=", n.censor, ")")
        )
    }
  }

  p4_ly %>% layout(
    title   = list(text = paste0(
      "Kaplan-Meier: Time to First Adverse Event",
      "<br><sup>Log-rank p = ", sprintf("%.4f", lr_pval), "</sup>"
    )),
    xaxis   = list(title = "Time (weeks)"),
    yaxis   = list(title = "Event-free probability",
                   tickformat = ".0%", range = c(0, 1)),
    legend  = list(orientation = "h", y = -0.15),
    hovermode = "closest",
    margin  = list(t = 80)
  ) %>%
  add_plotly_config()
} else {
  # Static fallback for github_document
  p_km <- ggplot(km_tidy, aes(x = time, y = estimate, colour = ARM, fill = ARM)) +
    geom_step(linewidth = 0.8) +
    geom_rect(
      aes(
        xmin = time,
        xmax = lead(time, default = max(time)),
        ymin = conf.low, ymax = conf.high
      ),
      alpha = 0.10, colour = NA
    ) +
    annotate(
      "text",
      x = max(km_tidy$time) * 0.65, y = 0.15,
      label = sprintf("Log-rank p = %.4f", lr_pval),
      size = 3.5, fontface = "italic", colour = "grey30"
    ) +
    scale_y_continuous(labels = percent_format(), limits = c(0, 1)) +
    labs(
      x      = "Time (weeks)",
      y      = "Event-free probability",
      colour = "Treatment arm",
      fill   = "Treatment arm"
    ) +
    theme_minimal(base_size = 11) +
    theme(legend.position = "top")

  p_risk <- ggplot(risk_tbl, aes(x = time, y = ARM, label = n.risk)) +
    geom_text(size = 3) +
    labs(x = "Time (weeks)", y = NULL, title = "Number at risk") +
    theme_minimal(base_size = 9) +
    theme(
      panel.grid = element_blank(),
      plot.title = element_text(size = 9, face = "bold"),
      axis.text.y = element_text(face = "bold")
    )

  grid.arrange(p_km, p_risk, ncol = 1, heights = c(4, 1))
}

Fig. 4. Kaplan–Meier curves for time to first adverse event by treatment arm, with 95 % confidence bands and censoring tick marks. The log-rank test p-value is annotated.

What we see: The Kaplan-Meier curves separate modestly between arms, with a log-rank p-value of 0.0446—nominally significant at the 0.05 level. In a real DM1 study this would warrant further investigation with a Cox proportional-hazards model adjusting for baseline covariates (age, sex, disease severity, CTG repeat length), and the result would be cross-referenced with the genomic and proteomic modalities to identify biological correlates of adverse event risk.


Summary

The four analyses above trace a deliberate progression:

# Analysis Type System capability it demonstrates
1 Demographic overview Descriptive Fast queries over the ADSL materialised view
2 SBP trajectories Longitudinal / time Cross-visit exploration of vital signs
3 ALT & CRP change from BL Longitudinal / safety Change-from-baseline monitoring for dashboards
4 KM time to first AE Time-to-event / inferential Parameterised survival module for notebooks

In the proposed dynamic analysis system, each of these would be:


Reproducibility

This analysis runs inside the project’s Nix development shell. To reproduce:

# Enter the environment (all dependencies resolved automatically)
nix develop

# Render this document
Rscript -e 'rmarkdown::render("small_exploratory_analysis/exploratory_analysis.Rmd")'
Package Purpose
tidyverse Data wrangling and ggplot2 plotting
survival Kaplan-Meier estimation, log-rank test
random.cdisc.data Synthetic CDISC ADaM datasets
plotly Interactive HTML visualisations
gridExtra Multi-panel plot arrangement (fallback)
broom Tidy survival model output
scales Axis formatting
rmarkdown / knitr Document rendering